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. Abstract 

This article discusses a latent variable model for inference and prediction of symmetric 
PQ ■ relational data. The model, based on the idea of the eigenvalue decomposition, represents the 

relationship between two nodes as the weighted inner-product of node-specific vectors of latent 
characteristics. This "eigenmodel" generalizes other popular latent variable models, such as 
latent class and distance models: It is shown mathematically that any latent class or distance 
model has a representation as an eigenmodel, but not vice-versa. The practical implications of 
this are examined in the context of three real datasets, for which the eigenmodel has as good or 
^ ' better out-of-sample predictive performance than the other two models. 
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^ , ^ Introduction 
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Let {yij : I < i < j < n} denote data measured on pairs of a set of n objects or nodes. The 

L^ I examples considered in this article include friendships among people, associations among words 

H I and interactions among proteins. Such measurements are often represented by a sociomatrix Y, 

which is a symmetric n x n matrix with an undefined diagonal. One of the goals of relational data 
analysis is to describe the variation among the entries of Y, as well as any potential covariation of 
Y with observed explanatory variables X = {xij, 1 < i < j < n}. 

To this end, a variety of statistical models have been developed that describe yij as some func- 
tion of node-specific latent variables Ui and Uj and a linear predictor (5 Xij. In such formulations, 
{ui, . . . ,Un} represent across-node variation in the ?/ij's and /3 represents covariation of the y^j's 
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Figure 1: Networks exhibiting homophily (left panel) and stochastic equivalence (right panel). 



with the Xjj's. For example, iNowicki and Snijdera 200 ll | present a model in which each node i is 

ity distribution des cribes the 



assumed to belong to an unobserved latent class Uj, and a pro babi 



relationships between each pair of classes (see iKemp et al. 



20041 1 and 



Airoldi et al, 



2005( 1 for recent 



extensions of this approach). Such a model captures stochastic equivalence, a type of pattern often 
seen in network data in which the nodes can be divided into groups such that members of the same 
group have similar patterns of relationships. 

An alternative approach to representing across-node variation is based on the idea of homophily, 
in which the relationships between nodes with similar characteristics are stronger than the rela- 
tionships between nodes having different characteristics. Homophily provides an explanation to 
data patterns often seen in social networks, such as transitivity ("a friend of a friend is a friend"), 
balance ("the enemy of my friend is an enemy") and t he existence of cohesive subgroups of nodes. 
In order to represent such patterns, iHoff et al.l 2002l | present a model in which the conditional 
mean of inj is a function of f3'xij — \ui — Uj\, where {ui, . . . , u„} are vectors of unobserved, latent 
characteristics in a Euclidean space. In the context of binary relational data, such a model predicts 
the existence of more transitive triples, or "triangles," than would be seen under a random allo- 
cation of edges among pairs of nodes. An important assumption of this model is that two nodes 
with a strong relationship between them are also similar to each other in terms of how they relate 



to other nodes: A strong relationship between i and j suggests \ui 



Ui 



is small, but this further 



Uk\ 



implies that \u, 
to other nodes. 

The latent class model of 



Uj — Uk\, and so nodes i and j are assumed to have similar relationships 



Nowicki and Snijdera 200l[ | and the latent distance model of 



Hoff et al 



2002l | are able to identify, respectively, classes of nodes with similar roles, and the locational 



properties of the nodes. These two items are perhaps the two primary features of interest in social 
network and relational data analysis. For example, discussion of these concepts makes up more 



than half of the 734 pages of main text in lWasserman and FaustI 19941 ] . However, a model that can 



represent one feature may not be able to represent the other: Consider the two graphs in Figure 
[TJ The graph on the left displays a large degree of transitivity, and can be well-represented by 
the latent distance model with a set of vectors {ui, . . . , Un} in two-dimensional space, in which the 
probability of an edge between i and j is decreasing in \ui — Uj\. In contrast, representation of 
the graph by a latent class model would require a large number of classes, none of which would 
be particularly cohesive or distinguishable from the others. The second panel of Figure [D displays 
a network involving three classes of stochastically equivalent nodes, two of which (say A and B) 
have only across-class ties, and one (C) that has both within- and across-class ties. This graph is 
well-represented by a latent class model in which edges occur with high probability between pairs 
having one member in each of A and i? or in i? and C, and among pairs having both members in 
C (in models of stochastic equivalence, nodes within each class are not differentiated). In contrast, 
representation of this type of graph with a latent distance model would require the dimension of 
the latent characteristics to be on the order of the class membership sizes. 

Many real networks exhibit combinations of structural equivalence and homophily in varying 
degrees. In these situations, use of either the latent class or distance model would only be repre- 
senting part of the network structure. The goal of this paper is to show that a simple statistical 
model based on the eigenvalue decomposition can generalize the latent class and distance models: 
Just as any symmetric matrix can be approximated with a subset of its largest eigenvalues and 
corresponding eigenvectors, the variation in a sociomatrix can be represented by modeling yij as a 
function of /3'xj.j + ufAuj, where {ui, . . . , n„} are node-specific factors and A is a diagonal matrix. 
In this article, we show mathematically and by example how this eigenmodel can represent both 
stochastic equivalence and homophily in symmetric relational data, and thus is more general than 
the other two latent variable models. 

The next section motivates the use of latent variables models for relational data, and shows 
mathematically that the eigenmodel generalizes the latent class and distance models in the sense 
that it can compactly represent the same network features as these other models but not vice- 
versa. Section 3 compares the out-of-sample predictive performance of these three models on three 
different datasets: a social network of 12th graders; a relational dataset on word association counts 
from the first chapter of Genesis; and a dataset on protein-protein interactions. The first two 
networks exhibit latent homophily and stochastic equivalence respectively, whereas the third shows 
both to some degree. In support of the theoretical results of Section 2, the latent distance and 
class models perform well for the first and second datasets respectively, whereas the eigenmodel 
performs well for all three. Section 4 summarizes the results and discusses some extensions. 



2 Latent variable modeling of relational data 

2.1 Justification of latent variable modeling 

The use of probabilistic latent variable models for the representation of relational data can be 
motivated in a natural way: For undirected data without covariate information, symmetry suggests 
that any probability model we consider should treat the nodes as being exchangeable, so that 

Pr({2/»j ■.l<i<j<n}eA)= Pr({y^i,^j : 1 < i < j < n} e A) 



'or an y pe rmutation vr o f the integers {1, . . . , n} and any set of sociomatrices A. Results of 



Hoover 



1982l | and lAldoud [19851 . chap. 14] show that if a model satisfies the above exchangeability condition 
for each integer n, then it can be written as a latent variable model of the form 

yij = h{fi,Ui,Uj,eij) (1) 

for i.i.d. latent variables {ui, . . . ,Un}, i.i.d. pair-specific effects {ejj : I < i < j < n} and some 
function h that is symmetric in its second and third arguments. This result is very general - it says 
that any statistical model for a sociomatrix in which the nodes are exchangeable can be written as 
a latent variable model. 

Difference choices of h lead to different models for y. A general probit model for binary network 
data can be put in the form of ^ as follows: 

{^ij '■ ^ ^ i < j ^ n} ~ i.i.d. normal(0, 1) 

{ui,...,Un} ~ i.i.d. f{u\ip) 
yi,j = h{n, Ui,Uj,€i^j) = 5(o,oo)(/^ + a{ui,Uj) + ejj), 

where // and V are parameters to be estimated, and a is a symmetric function, also potentially 
involving parameters to be estimated. Covariation between Y and an array of predictor variables 
X can be represented by adding a linear predictor P'^Xij to fi. Finally, integrating over tij we 
obtain Fv{yij = l\xij,Ui,Uj) = <&[/i + P'^Xij + a{ui,Uj)]. Since the e^j's can be assumed to be 
independent, the conditional probability of Y given X and {ui, . . . , Un} can be expressed as 

Pr(yjj = l\xij,Ui,Uj) = 9ij = $[// + P'^Xij + a{ui,Uj)] (2) 

FT{Y\X,uu...,Un) = nC"(l-M^'" 

Many relational datasets have ordinal, non-binary measurements (for example, the word association 
data in Section 3.2). Rather than "thresholding" the data to force it to be binary, we can make 
use of the full information in the data with an ordered probit version of ([2]) : 

Pr(yjj = y\xij,Ui, Uj) = Ofj = ^[^y + (f Xij + a{ui,Uj)] - ^[%+i + ifxij + a{ui,Uj)] 



Vi{Y\X,ui,...,Un) = n 






where {fiy} are parameters to be estimated for all but the lowest value y in the sample space. 

2.2 Effects of nodal variation 

The latent variable models described in the Introduction correspond to different choices for the 
symmetric function a: 

Latent class model: 

a{ui,Uj) = ruu^^u, 

Ui G {l,...,K}, i £ {!,... ,n} 

M a K X K symmetric matrix 

Latent distance model: 

a{ui,Uj) = —\ui — Uj\ 
Ui gM^, ie{l,...,n} 

Latent eigenmodel: 

a{ui, Uj) = ujAuj 

Ui gM^, i G {!,..., n} 

A a. K X K diagonal matrix. 

Interpretations of the latent class and distance models were given in the Introduction. An inter- 
pretation of the latent eigenmodel is that each node i has a vector of unobserved characteristics 
Ui = {"Uj,!, . . . , Ui^K}i and that similar values of Ui^k and Uj^k will contribute positively or negatively 
to the relationship between i and j, depending on whether A^ > or A^ < 0. In this way, the model 
can represent both positive or negative homophily in varying degrees, and stochastically equivalent 
nodes (nodes with the same or similar latent vectors) may or may not have strong relationships 
with one another. 

We now show that the eigenmodel generalizes the latent class and distance models: Let 5„ be 
the set of n X n sociomatrices, and let 

Ck = {C G Sn : Cjj = niu^^Uj, Mj G {1, . . . ,K}, M a K x K symmetric matrix}; 
T^K = {D e Sn : dij = -\ui- Uj\, Ui G M^}; 
£k = {E G Sn : Gij = "Uj Auj, mjGM , A a K x K diagonal matrix}. 

In other words, Ck is the set of possible values of {a{ui, Uj), 1 < i < j < n} under a i^-dimensional 
latent class model, and similarly for Vk and £k- 



£k generalizes Ck' Let C E Ck and let C be a completion of C obtained by setting Cj^j = mu^^Ui- 
There are at most K unique rows of C and so C is of rank K at most. Since the set £k contains 
all sociomatrices that can be completed as a lank-K matrix, we have Ck C £x- Since £k includes 
matrices with n unique rows, Ck C £k unless K > n in which case the two sets are equal. 

£k+i weakly generalizes Vk' Let D € T>k- Such a (negative) distance matrix will generally 
be of full rank, in which case it cannot be represented exactly by an £■ € £k for K < n. However, 
what is critical from a modeling perspective is whether or not the order of the entries of each 
D can be matched by the order of the entries of an E. This is because the probit and ordered 
probit model we are considering include threshold variables {^y : y & y} which can be adjusted to 
accommodate monotone transformations of a{ui,Uj). With this in mind, note that the matrix of 
squared distances among a set of X-dimensional vectors {zi, . . . , z„} is a monotonic transformation 
of the distances, is of rank K + 2 or less (as D^ = [z'lZi, . . . , z^z^]^!^ + l[z[zi, . . . , z'^Zn] — 2ZZ^) 
and so is in £k+2- Furthermore, letting Uj = (zj, wr^ — zfzi) € R^+^ for each i € {1, . . . ,n}, we 
have u'^Uj = z'^Zj + -s/ir"^ — \ui\'^){r'^ — |%p). For large r this is approximately r^ — \zi — Zjp/2, 
which is an increasing function of the negative distance dij. For large enough r the numerical order 
of the entries of this E G £k+i is the same as that of Z? € T^k- 

T>K does not weakly generalize £i: Consider E ^ £i generated by A = 1, ui = 1 and 

Ui = r < 1 for i > 1. Then r = ei^j^ = ei^i^ > ei-^^^i,^ = r^ for all ii,i2 7^ 1- For which K is such an 
ordering of the elements of Z? G Vk possible? If i^ = 1 then such an ordering is possible only if 
n = 3. For K = 2 such an ordering is possible for n < 6. This is because the kissing number in M?, 
or the number of non-overlapping spheres of unit radius that can simultaneously touch a central 
sphere of unit radius, is 6. If we put node 1 at the center of the central sphere, and 6 nodes at 
the centers of the 6 kissing spheres, then we have di^i-^ = di^i^ = di^^i^ for all ii,i2 7^ 1- We can 
only have di^i^ = di^i^ > di^^i^ if we remove one of the non-central spheres to allow for more room 
between those remaining, leaving one central sphere plus five kissing spheres for a total of n = 6. 
Increasing n increases the necessary dimension of the Euclidean space, and so for any K there are 
n and E ^ £i that have entry orderings that cannot be matched by those of any D € Tpw- 



A less general positive semi-definite version of the eigenmodel has been studied bv iHoffI [2005l |. 
in which A was taken to be the identity matrix. Such a model can weakly generalize a distance 
model, but cannot generalize a latent class model, as the eigenvalues of a latent class model could 
be negative. 



3 Model comparison on three different datasets 

3.1 Parameter estimation 

Bayesian parameter estimation for the three models under consideration can be achieved via Markov 
chain Monte Carlo (MCMC) algorithms, in which posterior distributions for the unknown quan- 
tities are approximated with empirical distributions of samples from a Markov chain. For these 
algorithms, it is useful to formulate the probit models described in Section 2.1 in terms of an addi- 
tional latent variable Zij ~ normal [/?'xjj -|- a{ui, Uj)], for which yij = y ii fiy < Zij < Hy+i- Using 
conjugate prior distributions where possible, the MCMC algorithms proceed by generating a new 
state (/>(*+i) = {Z(^+i),/i("+^),/3(*+^),MS'+^\ . . .,ui'^'^^} from a current state (/>(*) as follows: 

1. For each {i,j}, sample Zj.j from its (constrained normal) full conditional distribution. 

2. For each y & y, sample fiy from its (normal) full conditional distribution. 

3. Sample /3 from its (multivariate normal) full conditional distribution. 

4. Sample ui, . . . ,Un and their associated parameters: 

• For the latent distance model, propose and accept or reject new values of the lij's with 
the Metropolis algorithm, and then sample the population variances of the Uj's from 
their (inverse-gamma) full conditional distributions. 

• For the latent class model, update each class variable Uj from its (multinomial) condi- 
tional distribution given current values of Z, {uj : j ^ i} and the variance of the elements 
of M (but marginally over M to improve mixing) . Then sample the elements of M from 
their (normal) full conditional distributions and the variance of the entries of M from 
its (inverse-gamma) full conditional distribution. 

• For the latent vector model, sample each Uj from its (multivariate normal) full con- 
ditional distribution, sample the mean of the Uj's from their (normal) full conditional 
distributions, and then sample A from its (multivariate normal) full conditional distri- 
bution. 

To facilitate comparison across models, we used prior distributions in which the level of prior 
variability in a{ui,Uj) was similar across the three different models. An R package that implements 
the MCMC is available at cran . r-pro j ect . org/src/contrib/Descriptions/eigenmodel . htmlj 



3.2 Cross validation 

To compare the performance of these three different models we evaluated their out-of-sample pre- 
dictive performance under a range of dimensions {K € {3, 5, 10}) and on three different datasets 



Table 1: Cross validation results and area under the ROC curves. 



K 


Add health 




Genesis 


Protein interaction 




dist 


class 


eigen 


dist 


class 


eigen 


dist 


class 


eigen 


3 


0.82 


0.64 


0.75 


0.62 


0.82 


0.82 


0.83 


0.79 


0.88 


5 


0.81 


0.70 


0.78 


0.66 


0.82 


0.82 


0.84 


0.84 


0.90 


10 


0.76 


0.69 


0.80 


0.74 


0.82 


0.82 


0.85 


0.86 


0.90 



exhibiting varying combinations of homophily and stochastic equivalence. For each combination of 
dataset, dimension and model we performed a five-fold cross validation experiment as follows: 

1. Randomly divide the (2) data values into 5 sets of roughly equal size, letting Sjj be the set 
to which pair {i, j} is assigned. 

2. For each s G {1, . . . , 5}: 

(a) Obtain posterior distributions of the model parameter conditional on {yjj- : Sij 7^ s}, 
the data on pairs not in set s. 

(b) For pairs {k^l} in set s, let y^^i = E[yk^i\{yi.j : Sjj / s}], the posterior predictive mean 
of yk^i obtained using data not in set s. 

This procedure generates a sociomatrix Y , in which each entry yij represents a predicted value 
obtained from using a subset of the data that does not include yi^j. Thus y is a sociomatrix of 
out-of-sample predictions of the observed data Y . 



3.3 Adolescent Health social network 

The first dataset records friendship ties among 247 12th-graders, obtained from the National Longi- 
tudinal Study of Adolescent Health ( [www, cpc .unc . edu/projects/addhealth '). For these data, y^j- = 1 
or depending on whether or not there is a close friendship tie between student i and j (as reported 
by either i or j). These data are represented as an undirected graph in the first panel of Figure 
[2j Like many social networks, these data exhibit a good deal of transitivity. It is therefore not 
surprising that the best performing models considered (in terms of area under the ROC curve, 
given in Table 1) are the distance models, with the eigenmodels close behind. In contrast, the 
latent class models perform poorly, and the results suggest that increasing K for this model would 
not improve its performance. 
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Figure 2: Social network data and unsealed ROC curves for the K = 3 models. 



3.4 Word neighbors in Genesis 

The second dataset we consider is derived from word and punctuation counts in the first chapter of 
the King James version of Genesis ([www . gutenberg . org/dirs/etext05/bib0110 . txtp . There are 158 



unique words and punctuation marks in this chapter, and for our example we take yij to be the 
number of times that word i and word j appear next to each other (a model extension, appropriate 
for an asymmetric version of this dataset, is discussed in the next section). These data can be 
viewed as a graph with weighted edges, the unweighted version of which is shown in the first panel 
of Figure [3l The lack of a clear spatial representation of these data is not unexpected, as text data 
such as these do not have groups of words with strong within-group connections, nor do they display 
much homophily: a given noun may appear quite frequently next to two different verbs, but these 
verbs will not appear next to each other. A better description of these data might be that there 
are classes of words, and connections occur between words of different classes. The cross validation 
results support this claim, in that the latent class model performs much better than the distance 
model on these data, as seen in the second panel of Figure [3] and in Table 1. As discussed in the 
previous section, the eigenmodel generalizes the latent class model and performs equally well. We 
note that parameter estimates for these data were obtained using the ordered probit versions of the 
models (as the data are not binary), but the out-of-sample predictive performance was evaluated 
based on each model's ability to predict a non-zero relationship. 
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Figure 3: Relational text data from Genesis and unsealed ROC curves for the K = 3 models. 



3.5 Protein-protein interaction data 



Butland et al 



20051 ] ■ in which yij 



Our last example is the protein-protein interaction data of 
if proteins i and j bind and yij = otherwise. We analyze the large connected component of this 
graph, which includes 230 proteins and is displayed in the first panel of HI This graph indicates 
patterns of both stochastic equivalence and homophily: Some nodes could be described as "hubs" , 
connecting to many other nodes which in turn do not connect to each other. Such structure is 
better represented by a latent class model than a distance model. However, most nodes connecting 
to hubs generally connect to only one hub, which is a feature that is hard to represent with a small 
number of latent classes. To represent this structure well, we would need two latent classes per 
hub, one for the hub itself and one for the nodes connecting to the hub. Furthermore, the core of 
the network (the nodes with more than two connections) displays a good degree of homophily in 
the form of transitive triads, a feature which is easiest to represent with a distance model. The 
eigenmodel is able to capture both of these data features and performs better than the other two 
models in terms of out-of-sample predictive performance. In fact, the K = 3 eigenmodel performs 
better than the other two models for any value of K considered. 



4 Discussion 

Latent distance and latent class models provide concise, easily interpreted descriptions of social 
networks and relational data. However, neither of these models will provide a complete picture 
of relational data that exhibit degrees of both homophily and stochastic equivalence. In contrast, 
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Figure 4: Protein-protein interaction data and unsealed ROC curves for the K = 3 models. 

we have shown that a latent eigenmodel is able to represent datasets with either or both of these 
data patterns. This is due to the fact that the eigenmodel provides an unrestricted low-rank 
approximation to the sociomatrix, and is therefore able to represent a wide array of patterns in the 
data. 

The concept behind the eigenmodel is the familiar eigenvalue decomposition of a symmetric 
matrix. The analogue for directed networks or rectangular matrix data would be a model based 
on the singular value decomposition, in which data yij could be modeled as depending on uJDvj, 
where Ui and Vj represent vectors of latent row and column effects respectively. Statistical inference 
using the singular value decomposition for Gaussian data is straightforward. A model-based version 
of the approach for binary and other non-Gaussian relational datasets could be implemented using 
the ordered probit model discussed in this paper. 
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